{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Language Detection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Polyglot depends on [pycld2](https://pypi.python.org/pypi/pycld2/) library which in turn depends on [cld2](https://code.google.com/p/cld2/) library for detecting language(s) used in plain text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from polyglot.detect import Detector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "arabic_text = u\"\"\"\n",
    "أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن \" القوات الامنية تتوقف لليوم\n",
    "الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب\n",
    "انتشار قناصي التنظيم الذي يطلق على نفسه اسم \"الدولة الاسلامية\" والعبوات الناسفة\n",
    "والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية \".\n",
    "\"\"\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name: Arabic      code: ar       confidence:  99.0 read bytes:   907\n"
     ]
    }
   ],
   "source": [
    "detector = Detector(arabic_text)\n",
    "print(detector.language)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mixed Text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "mixed_text = u\"\"\"\n",
    "China (simplified Chinese: 中国; traditional Chinese: 中國),\n",
    "officially the People's Republic of China (PRC), is a sovereign state located in East Asia.\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the text contains snippets from different languages, the detector is able to find the most probable langauges used in the text.\n",
    "For each language, we can query the model confidence level:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "name: English     code: en       confidence:  87.0 read bytes:  1154\n",
      "name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755\n",
      "name: un          code: un       confidence:   0.0 read bytes:     0\n"
     ]
    }
   ],
   "source": [
    "for language in Detector(mixed_text).languages:\n",
    "  print(language)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To take a closer look, we can inspect the text line by line, notice that the confidence in the detection went down for the first line"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "China (simplified Chinese: 中国; traditional Chinese: 中國),\n",
      "\n",
      "name: English     code: en       confidence:  71.0 read bytes:   887\n",
      "name: Chinese     code: zh_Hant  confidence:  11.0 read bytes:  1755\n",
      "name: un          code: un       confidence:   0.0 read bytes:     0\n",
      "\n",
      "\n",
      "officially the People's Republic of China (PRC), is a sovereign state located in East Asia.\n",
      "\n",
      "name: English     code: en       confidence:  98.0 read bytes:  1291\n",
      "name: un          code: un       confidence:   0.0 read bytes:     0\n",
      "name: un          code: un       confidence:   0.0 read bytes:     0\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for line in mixed_text.strip().splitlines():\n",
    "  print(line + u\"\\n\")\n",
    "  for language in Detector(line).languages:\n",
    "    print(language)\n",
    "  print(\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Best Effort Strategy\n",
    "\n",
    "Sometimes, there is no enough text to make a decision, like detecting a language from one word.\n",
    "This forces the detector to switch to a best effort strategy, a warning will be thrown and the attribute `reliable` will be set to `False`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prediction is reliable: False\n",
      "Language 1: name: English     code: en       confidence:  85.0 read bytes:  1194\n",
      "Language 2: name: un          code: un       confidence:   0.0 read bytes:     0\n",
      "Language 3: name: un          code: un       confidence:   0.0 read bytes:     0\n"
     ]
    }
   ],
   "source": [
    "detector = Detector(\"pizza\")\n",
    "print(detector)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In case, that the detection is not reliable even when we are using the best effort strategy, an exception `UnknownLanguage` will be thrown."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "UnknownLanguage",
     "evalue": "Try passing a longer snippet of text",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mUnknownLanguage\u001b[0m                           Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-9-de43776398b9>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mDetector\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"4\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32m/usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc\u001b[0m in \u001b[0;36m__init__\u001b[1;34m(self, text, quiet)\u001b[0m\n\u001b[0;32m     63\u001b[0m     \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mquiet\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mquiet\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     64\u001b[0m     \u001b[1;34m\"\"\"If true, exceptions will be silenced.\"\"\"\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 65\u001b[1;33m     \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdetect\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     66\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     67\u001b[0m   \u001b[1;33m@\u001b[0m\u001b[0mstaticmethod\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32m/usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc\u001b[0m in \u001b[0;36mdetect\u001b[1;34m(self, text)\u001b[0m\n\u001b[0;32m     89\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     90\u001b[0m       \u001b[1;32mif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mreliable\u001b[0m \u001b[1;32mand\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mquiet\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 91\u001b[1;33m         \u001b[1;32mraise\u001b[0m \u001b[0mUnknownLanguage\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Try passing a longer snippet of text\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     92\u001b[0m       \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     93\u001b[0m         \u001b[0mlogger\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mwarning\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Detector is not able to detect the language reliably.\"\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mUnknownLanguage\u001b[0m: Try passing a longer snippet of text"
     ]
    }
   ],
   "source": [
    "print(Detector(\"4\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Such an exception may not be desirable especially for trivial cases like characters that could belong to so many languages.\n",
    "In this case, we can silence the exceptions by passing setting `quiet` to `True`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prediction is reliable: False\n",
      "Language 1: name: un          code: un       confidence:   0.0 read bytes:     0\n",
      "Language 2: name: un          code: un       confidence:   0.0 read bytes:     0\n",
      "Language 3: name: un          code: un       confidence:   0.0 read bytes:     0\n"
     ]
    }
   ],
   "source": [
    "print(Detector(\"4\", quiet=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Command Line"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usage: polyglot detect [-h] [--input [INPUT [INPUT ...]]]\r\n",
      "\r\n",
      "optional arguments:\r\n",
      "  -h, --help            show this help message and exit\r\n",
      "  --input [INPUT [INPUT ...]]\r\n"
     ]
    }
   ],
   "source": [
    "!polyglot detect --help"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The subcommand `detect` tries to identify the language code for each line in a text file.\n",
    "This could be convieniet if each line represents a document or a sentence that could have been generated by a tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "English             Australia posted a World Cup record total of 417-6 as they beat Afghanistan by 275 runs.\r\n",
      "English             David Warner hit 178 off 133 balls, Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth.\r\n",
      "English             Afghanistan were then dismissed for 142, with Mitchell Johnson and Mitchell Starc taking six wickets between them.\r\n",
      "English             Australia's score surpassed the 413-5 India made against Bermuda in 2007.\r\n",
      "English             It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages, following South Africa's 408-5 and 411-4 against West Indies and Ireland respectively.\r\n",
      "English             The winning margin beats the 257-run amount by which India beat Bermuda in Port of Spain in 2007, which was equalled five days ago by South Africa in their victory over West Indies in Sydney.\r\n"
     ]
    }
   ],
   "source": [
    "!polyglot detect --input testdata/cricket.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Supported Languages\n",
    "\n",
    "cld2 can detect up to 165 languages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  1. Abkhazian                  2. Afar                       3. Afrikaans                \n",
      "  4. Akan                       5. Albanian                   6. Amharic                  \n",
      "  7. Arabic                     8. Armenian                   9. Assamese                 \n",
      " 10. Aymara                    11. Azerbaijani               12. Bashkir                  \n",
      " 13. Basque                    14. Belarusian                15. Bengali                  \n",
      " 16. Bihari                    17. Bislama                   18. Bosnian                  \n",
      " 19. Breton                    20. Bulgarian                 21. Burmese                  \n",
      " 22. Catalan                   23. Cebuano                   24. Cherokee                 \n",
      " 25. Nyanja                    26. Corsican                  27. Croatian                 \n",
      " 28. Croatian                  29. Czech                     30. Chinese                  \n",
      " 31. Chinese                   32. Chinese                   33. Chinese                  \n",
      " 34. Chineset                  35. Chineset                  36. Chineset                 \n",
      " 37. Chineset                  38. Chineset                  39. Chineset                 \n",
      " 40. Danish                    41. Dhivehi                   42. Dutch                    \n",
      " 43. Dzongkha                  44. English                   45. Esperanto                \n",
      " 46. Estonian                  47. Ewe                       48. Faroese                  \n",
      " 49. Fijian                    50. Finnish                   51. French                   \n",
      " 52. Frisian                   53. Ga                        54. Galician                 \n",
      " 55. Ganda                     56. Georgian                  57. German                   \n",
      " 58. Greek                     59. Greenlandic               60. Guarani                  \n",
      " 61. Gujarati                  62. Haitian_creole            63. Hausa                    \n",
      " 64. Hawaiian                  65. Hebrew                    66. Hebrew                   \n",
      " 67. Hindi                     68. Hmong                     69. Hungarian                \n",
      " 70. Icelandic                 71. Igbo                      72. Indonesian               \n",
      " 73. Interlingua               74. Interlingue               75. Inuktitut                \n",
      " 76. Inupiak                   77. Irish                     78. Italian                  \n",
      " 79. Ignore                    80. Javanese                  81. Javanese                 \n",
      " 82. Japanese                  83. Kannada                   84. Kashmiri                 \n",
      " 85. Kazakh                    86. Khasi                     87. Khmer                    \n",
      " 88. Kinyarwanda               89. Krio                      90. Kurdish                  \n",
      " 91. Kyrgyz                    92. Korean                    93. Laothian                 \n",
      " 94. Latin                     95. Latvian                   96. Limbu                    \n",
      " 97. Limbu                     98. Limbu                     99. Lingala                  \n",
      "100. Lithuanian               101. Lozi                     102. Luba_lulua               \n",
      "103. Luo_kenya_and_tanzania   104. Luxembourgish            105. Macedonian               \n",
      "106. Malagasy                 107. Malay                    108. Malayalam                \n",
      "109. Maltese                  110. Manx                     111. Maori                    \n",
      "112. Marathi                  113. Mauritian_creole         114. Romanian                 \n",
      "115. Mongolian                116. Montenegrin              117. Montenegrin              \n",
      "118. Montenegrin              119. Montenegrin              120. Nauru                    \n",
      "121. Ndebele                  122. Nepali                   123. Newari                   \n",
      "124. Norwegian                125. Norwegian                126. Norwegian_n              \n",
      "127. Nyanja                   128. Occitan                  129. Oriya                    \n",
      "130. Oromo                    131. Ossetian                 132. Pampanga                 \n",
      "133. Pashto                   134. Pedi                     135. Persian                  \n",
      "136. Polish                   137. Portuguese               138. Punjabi                  \n",
      "139. Quechua                  140. Rajasthani               141. Rhaeto_romance           \n",
      "142. Romanian                 143. Rundi                    144. Russian                  \n",
      "145. Samoan                   146. Sango                    147. Sanskrit                 \n",
      "148. Scots                    149. Scots_gaelic             150. Serbian                  \n",
      "151. Serbian                  152. Seselwa                  153. Seselwa                  \n",
      "154. Sesotho                  155. Shona                    156. Sindhi                   \n",
      "157. Sinhalese                158. Siswant                  159. Slovak                   \n",
      "160. Slovenian                161. Somali                   162. Spanish                  \n",
      "163. Sundanese                164. Swahili                  165. Swedish                  \n",
      "166. Syriac                   167. Tagalog                  168. Tajik                    \n",
      "169. Tamil                    170. Tatar                    171. Telugu                   \n",
      "172. Thai                     173. Tibetan                  174. Tigrinya                 \n",
      "175. Tonga                    176. Tsonga                   177. Tswana                   \n",
      "178. Tumbuka                  179. Turkish                  180. Turkmen                  \n",
      "181. Twi                      182. Uighur                   183. Ukrainian                \n",
      "184. Urdu                     185. Uzbek                    186. Venda                    \n",
      "187. Vietnamese               188. Volapuk                  189. Waray_philippines        \n",
      "190. Welsh                    191. Wolof                    192. Xhosa                    \n",
      "193. Yiddish                  194. Yoruba                   195. Zhuang                   \n",
      "196. Zulu                     \n"
     ]
    }
   ],
   "source": [
    "from polyglot.utils import pretty_list\n",
    "print(pretty_list(Detector.supported_languages()))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}